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Abstract. — ^While the majority of gene histories found in a clade of organisms are expected 
to be generated by a common process (e.g. the coalescent process), it is well-known that 
numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and 
subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct 
from the histories of the majority of genes. Such "outlying" gene trees are considered to be 
biologically interesting and identifying these genes has become an important problem in 
phylogenetics. In this paper we propose and implement a nonparametric method of 
estimating distributions of phylogenetic trees, with the goal of identifying trees which are 
significantly different from the rest of the trees in the sample. 

Our approach mimics the common statistical technique of kernel density estimation, 
using tree distances to define kernels. In contrast to parametric models, such as coalescent, 
nonparametric approaches avoid the problem of model mis-specification, which leads to 
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potentially unreliable results. Our method demonstrated superior accuracy in outlier 
detection on simulated data, when compared to a previously published method. We also 
applied our method to a dataset of Apicomplexa genes, identifying a set of putative 
outliers. Our method for estimating tree distributions is implemented as the R package, 
kdetrees, and is available for download from CRAN. (Keywords: phylogenetics, 
nonparametric, gene trees, R package) 
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A central problem in systematic biology is the reconstruction of populations and 



species from numerous gene trees with varying levels of discordance (Brito and Edwards 



2009 Edwards 2009). Although there is a well-established understanding that discordant 



phylogenetic relationships will exist among independent gene trees drawn from a common 



species tree (Maddison 1997 Pamilo and Nei 1988 Takahata 1989), phylogenetic studies 
have only recently begun to shift away from single-gene and concatenated-gene estimates of 



phylogeny in favor of multi-locus methods (Carling and Brumfield 2008). These newer 
approaches focus on the role of genetic drift in producing patterns of incomplete lineage 



sorting and gene tree/species tree discordance, largely using coalescent theory (Degnan and 



Salter 2005 Rosenberg 2002, 2003). These theoretical developments have been used to 



reconstruct species trees from samples of estimated gene trees (Carstens and Knowles 2007 



Edwards et al. 2007 Maddison and Knowles 2006 Mossel and Roch 2007 [RoyChoudhury 



et al. 2008). 



Detecting concordance among gene trees is also a topic of interest. For example. 



Ane et al. (2007) developed a Bayesian method to estimate concordance among gene trees 



using molecular data from multiple loci. The method can produce estimated gene trees as 
well as an estimate of the proportion of the genome supporting a particular clade. 
However, a priori assumptions must be made about the degree and structure of 
concordance present in the gene trees. 

Although there is a tremendous amount of ongoing effort to develop better 
parametric models for gene tree distributions, the parametric framework has inherent 
difficulties. While parametric models typically make the most efficient use of a given data 
set when the model specified is correct, they achieve this efficiency by making many 
assumptions about the form of the distribution of gene trees. This can lead to erroneous 
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inferences when the model assumptions do not reflect reality. Given that many questions 
remain about the proper way to incorporate a number of important processes into a 
parametric model (e.g. geographic barriers to migration, or a population bottleneck), the 
problem of model mis-specification is very real. Nonparametric estimation avoids these 
modeling issues entirely, making very few assumptions about the structure of gene tree 
distributions. 

In this paper, we focus on the problem of identifying significant discordance among 
gene trees, as well as estimating the distribution of gene trees as a whole. This set of gene 
trees is assumed to consist mostly of "typical" (or "non-outlier") gene trees, which are 
assumed to be independently sampled from some distribution /. For example, gene trees 
which have evolved neutrally under a coalescent process. In addition, there are a smaller 
number of "outlier" gene trees which are sampled from a very different distribution /'. 
These genes are assumed to arise from less common evolutionary processes; for example, 
neofunctionalization, horizontal gene transfer, or periods of rapid molecular evolution. In 
addition, more mundane errors, such as incorrect sequencing, alignment, tree 
reconstruction, or annotation can also produce outlier trees in a data set. Our method 
produces a nonparametric estimate of the distribution / and also attempts to identify 
potential outlier gene trees which are probably not generated by /. Trees identified as 
outliers can then be inspected more closely for biologically interesting properties. In 
particular, identifying and removing outliers that violate model assumptions can improve 
the accuracy of inferences made from a collection of gene trees. 
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Related Work 



The method presented in this paper is not a statistical method for hypothesis testing, but 
rather for "discovering possible outliers" present in a given collection of orthologous genes. 
However, there has been significant work devoted to the development of statistical methods 
for testing hypotheses of discordance between the trees in a collection. The reviewed 
methods in Poptsova ( 2009[ ) are the following: (i) likelihood-based tests of tree topologies 



(Kishino-Hasegawa (KH) (Kishino and Hasegawa 1989), Shimodaira-Hasegawa (SH) 



(Shimodaira and Hasegawa 1999[ ), Approximately Unbiased (AU) tests (Shimodaira 2002)); 
(ii) tree distance methods (Robinson-Foulds (RF) and subtree pruning and regrafting 



(SPR) distances); and (iii) genome spectral approaches (bipartition (Lockhart et al. 1995) 
and quartet decomposition analysis ( Piaggio-Talice et al. [2004 )). 

The likelihood-based tests of tree topologies and tree distance methods are 
statistical hypothesis tests that detect significant incongruence between trees, i.e., they are 
testing the following hypotheses: 

i^o: A given species tree and a given gene tree are congruent. 
Hi'. A given species tree and a given gene tree are incongruent. 

The distinction between likelihood and distance based methods is in how they calculate the 
p-value of these hypotheses. The likelihood-based tests compare each gene tree with a 
species/reference tree using a likelihood value, to see if the incongruence is "statistically 



significant." These methods are also known as partition likelihood support (PLS) (Lee and 



Hugall Feb., 2003). Tree distance methods estimate the p-value of the hypotheses above by 



computing a distance between a reference tree and each gene tree. Holmes (2005) describes 



a framework for statistical hypothesis testing on trees based on tree distances using 
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distributions of phylogenetic trees (e.g. a posterior distribution or bootstrap resampling). 
Holmes also presents a statistical method to compare two sets of bootstrap sampling 



distributions, using the mean and variance of each distribution (Holmes 2005, Section 



4.4.1). A statistical nonparametric method for detecting significant discordance between 



two sets of trees via the supporting vector machines (SVMs) was introduced by Haws et al 



(2012). This is a nonparametric method for statistical testing of the hypotheses: 



Hq : Two sets of trees are drawn from the same distribution. 

Hi : Two sets of trees are not drawn from the same distribution. 

While likelihood-based tests assume that the species tree is known, genome spectral 
approaches do not use such a reference tree. Genome spectral methods summarize a set of 
gene trees with phylogenetic spectra (frequencies), such as splits or quartets. These 
frequencies can be used to approximate the distribution of gene trees, instead of producing 
a summarizing tree. Outlier trees can be identified by looking for trees whose highly 
supported features disagree with prevalent features in the spectra (Nepusz et al. 2010). 



A non-statistical approach for summarizing collections of gene trees is presented by 



Nye (2008). Treating each gene tree as a leaf node, a "meta-tree" is constructed where 
nodes correspond to phylogenetic trees; distances between nodes of the meta-tree 
correspond to distances between phylogenetic trees, and internal nodes correspond to gene 
trees with various branches collapsed. When using the RF distance, the nonparametric 
method proposed in this paper can be viewed as a numerical summarization of the 



meta-tree in (Nye 2008). 



Recently, de Vienne et al. (2012) developed a statistical nonparametric method to 
detect outlier trees from the set of gene trees. They first convert gene trees into vectors in 
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a multi-dimensional Euclidean space and then apply Principal Components Analysis 
(PC A) directly to these vectorized gene trees in order to find outliers. Included in our 
results is a simulation study comparing our non-parametric method with de Vienne's 
Phylo-MCOA method. 



Overview of Our Approach 

Let Tn denote the set of all tree topologies on n taxa (which we call tree space). We consider 
the trees to be unrooted, but rooted trees can be treated similarly. Our main object of 
study is a sample, {Tj}^^, of trees (gene trees) mostly drawn from a distribution / on 
Tn- If n is large enough that \Tn\ ^ N then many tree topologies in the sample may have 
low empirical frequency. In this case, / cannot be estimated well by assigning f(T) to be 
the empirical frequency of T in the sample. On the other hand, if / corresponds to a model 
such as coalescent, it is reasonable to expect that topologies "close" to many observed trees 
will have a higher likelihood than topologies "far away" from the observed trees. 

Kernel density estimation is a nonparametric technique to estimate a distribution 
that generated a sample, by leveraging the fact that points close to sample points tend to 
have higher likelihood than distant outlier points (under adequate assumptions on the 
distribution). Kernel density estimation can be viewed as a refined version of 
histogram-based estimation of a density. As the term density suggests, kernel density 
estimation is typically formulated for continuous variables over M''. However, similar 
methods can also be devised to estimate distributions over a finite set such as tree space. A 
key ingredient is the ability to measure similarity between trees. Fortunately, research in 
phylogenetics has produced several classical distances on tree space, such as the 



dissimilarity map distance (Buneman 1971), the topological dissimilarity distance measure 
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(Steel and Penny 1993), the Robinson-Foulds distance (Robinson and Foulds 1981), and 



the quartet distance (Estabrook et al. 1985). 

Our method uses existing tree distances to estimate a tree distribution by 
mimicking kernel density estimation. Our main goal is to identify regions of %i which have 
high probability, as well as observed trees with markedly low estimated probability. These 
low-probability trees are potentially outlier trees; i.e., trees having evolutionary histories 
unlikely to have arisen from the same model that generated the non-outlier trees. Our 
approach is nonparametric, which makes it quite general, and avoids problematic issues 
such as model design and selection that one encounters when using a parametric model 
(such as the coalescent). Unfortunately, using a small sample to learn an arbitrary 
distribution on tree space is inherently difficult, especially as the dimension of Tn grows, 
and we do not expect to learn the tree distribution with high accuracy for every tree 
topology. However, estimates of the density in regions where the probability is high can be 
quite good. 



Results 

We present the software package kdetrees for nonparametric estimation of tree 
distributions and detection of outlier trees. The software takes a sample of trees, 
Ti, . . . , T/v, in Newick format as input, and estimates for each tree a "score" based on a 
nonparametric estimator of the tree density. The kdetrees package can also attempt to 
improve the estimator / by applying a greedy approach to iteratively remove potential 
outlier trees from the computation. The output is a list of scores, which are the density 
estimates of the observed trees. Plots summarizing the findings can be produced, as well as 
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a list of putative outlier trees. 



The kdetrees package is written in R (R Development Core Team 2011), and makes 



use of some functions in the packages ape and phangorn (Paradis et al. 2004; Schliep 



2011). It has been tested on the Windows, OSX, and Linux platforms, and is released 
under the General Public License. The software is available for download from CRAN, or 
from the first author's website at http : //github . com/grady/kdetrees/ 



Simulation Results 
Simulations were run to compare the performance of our method against that of 



Phylo-MCOA (de Vienne et al. 2012). Phylo-MCOA is a R package which identifies putative 
outlying genes in a data set. The simulation also examined the performance of two tree 
distances, the topological dissimilarity and the dissimilarity map. (See the Choice of tree 
distance section in Materials and Methods for details.) We studied two different data 
generating methods for the simulated non-outlier trees: a "single" distribution of coalescent 
trees contained within one species tree, and a "mixed" distribution contained within 5 
species trees (i.e., the trees are generated by a mixture of five coalescent distributions). 
The variability of the coalescent trees is determined by the effective population size, the 
final parameter studied in the simulation. The proportion of the simulated data sets where 
each method correctly identified an added outlier tree is illustrated in Figure [T] 

We ran a second simulation studying the difference between the score distributions 
of outlier trees and non-outlier trees, as the ability of our method to reliably detect 
outlying trees depends on a tendency by outlier trees to produce scores significantly lower 
than the scores of non-outlier trees. The results are presented in Figure [2] We found that 
while there is some overlap between the score distributions, the distribution of scores for 
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outlier trees lies significantly below that of non-outlier trees. 



Application to Apicomplexa Data Set 



We applied our method to the data set presented in Kuo et al. (2008), which consists of a 
set trees generated from of 268 orthologous sequences from eight species of protozoa. When 
employing the dissimilarity map method, our method identified several putative outlier 
trees. Closer inspection of these trees suggests that these trees correspond to questionable 
sequence alignments which likely arose either from the inclusion of non-homologous genes in 
the data set, or from poor sequence annotations. The list of putative outlier genes selected 
by kdetrees is presented in Table [1} with additional discussion in Supplemental Table S.l 



Trees identified as putative outliers are plotted in Supplemental Figures |S.3[ |S.4[ and |S.5 



Running Time. — The running time of our kdetrees method also improves on that of 
Phylo-MCOA by a factor of approximately 20, when applied to the Apicomplexa data set. 
On a 2.3GHz MacBook Pro with 16GB of memory Phylo-MCOA took approximately 
8300ms to analyze the data, while kdetrees ran in approximately 350ms. 



Discussion 

Our proposed new method is motivated by the fact that existing methods of phylogenetic 
analysis and tree comparisons are not adequate for genomic scale phylogenetic analysis, 
particularly in cases of certain non-canonical evolutionary phenomena. In particular, we 
address the challenge posed by hybridization and lateral gene transfer, i.e., cases of gene 
flow between two lineages segregated and later reunited. As we observed for the 
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Apicomplexa data set, our approach also serves as a means of identifying problematic gene 
trees which arise from experimental artifacts such as misannotations or misalignments. The 
scenario in our mixed coalescent distribution simulation — where the non-outlier trees are 
sampled from an unknown mixture of distributions — cannot be handled by other methods, 
with the possible exception of the genome spectral methods. However, the genome spectral 
methods ignore possible statistical dependencies between different feature spectra. In 
contrast, we propose analyzing a collection of gene trees without reducing gene trees to 
summarizing information. 

In this paper we mimic the kernel density estimator, but without normalizing the 
estimated distribution. Normalizing the estimator requires that we estimate the partition 
function, that is, the summation of the estimator over all possible trees in tree space. This 
summation is particularly difficult since the space of phylogenetic trees is not a Euclidean 
space, but is a union of lower dimensional polyhedral cones which do not themselves form a 
vector space dBillera et al?][200T| ). (See |Grunbaim| ( |2003| ) , [Sturmfels] ( [l996| , and [Zligli^ 



(1995) for a detailed description of polyhedral cones.) 

The estimator requires us to select a kernel function, k. The kernel function defines 
how each observed tree contributes to the density estimate. The general form of k we use is 
the generalized Gaussian kernel, k{T,T') = exp[—{d{T,T')/hY], where T, T' G 7^. This 
class of kernel functions also requires that we select a tree distance, d(T,T'), a shape 
parameter, 6, and a bandwidth, h. The function d measures a distance or dissimilarity 
between pairs of trees. Ideally, d should define a metric on tree space, but we do not 
strictly require this. (See Materials and Methods for additional details.) 

For the choice of distance measures between two trees, we implemented the 



dissimilarity map distance (Buneman 1971) and the topological dissimilarity distance 
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measure (Steel and Penny 1993), primarily due to their simplicity. The use of other tree 
distances, such as RF distance or quartet distance, is completely compatible with our 
method; however, we did not implement them at this time since they are more 
computationally complex. The running time of kdetrees is dominated by the tree 
vectorization and pairwise-distance calculations, and so a choosing a computationally 
feasible tree distance is vital. 

The simulation results suggest that the dissimilarity map distance measure is 
superior to the topological dissimilarity distance measure. This result was expected, since 
the dissimilarity map distance measure incorporates more information (the branch lengths) 
than the topological dissimilarity distance measure. Future work should investigate the 
behavior of other distance measures; in particular, the geodesic distances described by 



Billera et al. (2001) and implemented by Owen and Provan (2011) appear to be a 



promising extension of the kdetrees method. 



Simulations 

The results of our simulations were generally promising. Our method was able to identify 
the outlier tree at least as reliably as Phylo-MCOA in every situation tested. In particular, 
our method was modestly successful in the mixed-distribution situation — a situation that is 
quite difficult — while Phylo-MCOA almost completely failed. 

In the single-distribution simulations with an effective population size of 500, the 
variance of the coalescent trees is minimal, and kdetrees was able to reliably detect the 
outlier tree. Phylo-MCOA's performance was good when using the branch length 
information, but the topology-only method was only marginally successful. As the effective 
population size was increased, the performance of all methods declined, but kdetrees 
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(using the dissimilarity map distance) retained the best performance. 

The mixed-distribution simulations present a greater challenge to the methods. This 
situation, in particular, is very difficult to tackle with parametric methods. We found that 
Phylo-MCOA was almost never able to find the outlier tree, while kdetrees was able to 
retain most of its performance, especially when using the dissimilarity map distance. 

The second simulation set compared the distribution of scores for outlier trees to 
the scores of non-outlier trees. Although the distributions are not completely distinct, it is 
clear that the outlier trees tend to have scores smaller than the majority of non-outlier 
trees. Since the outlier trees were generated as completely random coalescent trees, there 
will inevitably be trees generated which have structure similar to the non-outlier trees, 
simply by chance, and this accounts for some of the overlap between the distributions. 
With real data, such trees would correspond to genes which have some exotic history, but 
nonetheless appear to have a phylogeny substantially similar to the rest of the genes in the 
genome. In this case, it is ambiguous whether or not such a gene should be legitimately 
classified as an outlier. 



Apicomplexa Data 



The phylum Apicomplexa contains many important protozoan pathogens Levine (1988), 
including the mosquito-transmitted Plasmodium spp., the causative agents of malaria; T. 
gondii, which is one of the most prevalent zoonotic pathogens worldwide; and the 
water-born pathogen Cryptosporidium spp. Several members of the Apicomplexa also 
cause significant morbidity and mortality in both wildlife and domestic animals. These 
include Theileria spp. and Babesia spp., which are tick-borne haemoprotozoan ungulate 
pathogens, and several species of Eimeria, which are enteric parasites that are particularly 
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detrimental to the poultry industry. Due to their medical and veterinary importance, 
whole genome sequencing projects have been completed for multiple prominent members of 
the Apicomplexa. 



Our analysis of the data set presented in Kuo et al. (2008), which consists of 268 
orthologous genes from seven species of Apicomplexa and one out group ciliate, 
Tetrahymena thermophelia, identified several putative outlier trees. Supplementary Figure 



S.l| illustrates the scores that the method generated for the trees in the data set. 

Of the trees identified by the dissimilarity map method, it seems that most are 
likely attributable to either incorrect annotation or the inclusion of non-orthologous genes. 
The most common culprits were sequences from Eimeria tenella (Et). (See Table [l] and 



Supplementary Table S.l for more details.) The results from the topological dissimilarity 
method were less decisive. Here there were no clearly identifiable problems with the trees 
or sequences in most cases identified as putative outliers. This result is similar to that 
found in the simulation studies, suggesting that the incorporation of the branch length 
information by the dissimilarity map provides superior results. 

These results demonstrate the potential applicability of the kdetrees method to 
the curation of genetic data sets, by providing a simple tool for highlighting sequences or 
alignments that may be of further interest. Although in this situation we were only able to 
identify potential data quality issues, we feel that the method may also be of significant use 
in exploratory data analysis and hypothesis generation. 
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Given an independent and identically distributed sample of trees Ti, . . . , T/v, we propose a 
nonparametric estimator of the distribution that generated the sample with the form 



Here k, the kernel function, is a non-negative function defined on pairs of trees which 
measures how "similar" two trees are. For our approach, we do not require A; to be a kernel 
in a strict statistical sense. 

In kdetrees we have implemented a kernel of the form. 



where d is a distance metric on trees, 5 > is a "shape" parameter, and the hi > are a 
set of "bandwidth" parameters that control how tightly each contribution k(T,Ti) will be 
centered on Tj. Allowing the bandwidth to vary with the sample points, Tj, is called an 
adaptive bandwidth method, alternatively the bandwidth can be set to a constant value for 



In general, we can remove the symmetry and triangle inequality requirements for d, 
and it is possible that the sum over tree space, X^tsT ^(^' ^')' '^^^^ vary with T'. Ideally, 
we would remedy this issue by normalizing k{-,T') so that ^^^^t- ^(7", 7"') = 1- (This is the 
case most analogous to kernel density estimation.) However, for reasonable choices of d, 

k{T, T') is not expected to vary significantly for most T", so we ignore this issue rather 
than trying to compute or estimate each ^^^^ /c(T, Tj). 



i=l 




aU Ti. 
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Since the ultimate goal is to detect outlier trees, Tj, which are not actually drawn 
from the true distribution /, we are most concerned with estimating the density at the 
observed sample points. In this context, it makes sense to use a "leave-one-out" estimator 
which excludes the contribution of the point in question from the density estimate. 



9{Tj) oc ^ 



This estimator simply transforms probability estimates via A{x) = N{x — c)/ (A^ — 1) for 
some c. Assuming the sample is drawn i.i.d. from a distribution /, for fixed d and 6, both 
g{T) and f(T) (once normalized) will converge to / as — )■ oo, so long as the hi{N) — 0. 
This result follows immediately from the finiteness of tree space. 

This idea can be extended to also exclude the contribution of a number of trees 
which are determined to be outliers. Since the magnitude of g{Tj) can be used as a measure 
of evidence for Tj being an outlier, kdetrees can iteratively remove from the calculation 
the contribution of the tree which minimizes g{Tj), and recompute the estimator g with the 
reduced sample. This process can be iterated to remove multiple putative outliers. 

Choice of tree distance. — In our approach, trees can be incorporated into a statistical 
framework by converting them into a numerical vector format based on a distance matrix 
or map. These vectorized trees can then be analyzed as points in a multi-dimensional space 



where the distance between trees increases as they become more dissimilar (Graham and 



Kennedy 


2010; 


Hillis et al. 


2005 



For the choice of d, we propose distances derived from two different distances on 
trees: dissimilarity map d^is and topological dissimilarity map dtop- The dissimilarity map 
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distance measure between two trees is the euclidean distance, 



ddis(T',T) = \ \vdis(T) - Vdis{T')\\2, 
where VdisiT) G M.^^^ is the vector whose (z, j)-th entry is the sum of branch lengths on the 



path between leaves i and j in T. The dissimilarity map distance is studied in Buneman 



(1971). The topological dissimilarity map distance measure between two trees is the 
euclidean distance, 

dtop{T',T) = \\vtopiT) - VtopiT')\\2, 

where VtopiT) G M.^'-^ is the vector whose (z, j)-th entry counts the number of edges between 
leaves i and j in T. The topological dissimilarity distance measure, also called the path 



difference, was studied in Steel and Penny (1993). An example calculation of both Vdis and 



Vtop is shown in Figure S.2 



Estimation of bandwidth. — The estimator g depends crucially on the choice of the 
bandwidth parameter h. We employ a nearest-neighbor approach to estimate an adaptive 
bandwidth for each sample point. To estimate the bandwidth for a point T,, we use the 
distance to the m-th closest sample point. This approach has the effect of causing the 
kernels to be concentrated in areas where there is a lot of data, and diffuse in the tails of 
the distribution. In the current version of kdetrees m is defaulted to be 20% of the sample 
size, a heuristic value chosen based on simulation results. 

Alternatively, the bandwidth can be set to a constant value for all Tj. In order to do 
this we must find a way to choose an optimal value for the bandwidth h. We experimented 
with a constant bandwidth chosen by estimating the partition function Zh = YlxdhiT) 
using a random sample of trees. However, it seems that we tend to under-estimate the 
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bandwidth h and the results are not as robust as in the case of the adaptive bandwidth. 

Running time. — The computation time of kdetrees is dominated by the step where 
pairwise tree distances are calculated. For trees, each with n taxa, this step takes 
Oin^N'^) operations. Computation of the scores themselves is 0{N'^), and if k trees are 
removed the iterative tree-removal algorithm requires 0{kN'^) operations. 



Simulation 

Our first simulation evaluated the performance of our method and compared it with the 



results obtained with Phylo-MCOA. The Python library DendroPy (Sukumaran and Holder 



2010) was used to generate the data sets used in the simulations. Six species trees (see 



Figure |3|) were generated to contain the non-outlier trees sampled from Kingman's 



coalescent model (Kingman 1982). A second set of coalescent trees (with the same effective 
population size as the species trees) was generated to serve as "outlier" trees. 

For each simulation replicate we generated 100 coalescent trees contained within the 
species tree(s), and appended to this set one outlier tree. For the "single" distribution 
simulation, all 100 non-outlier trees were contained within the top-left tree in Figure [3j For 
the "mixed" distribution simulation, each of the remaining 5 trees in Figure [3] was used to 
generate 20 contained coalescent trees. The effective population size for the generation of 
these contained trees was one of the parameters studied in the simulation. Figure |4] 
summarizes the simulation processes. 

A second simulation compared the distribution of outlier tree scores to the 
distribution of non-outlier tree scores. Five hundred coalescent trees were generated within 
the species tree in the top-left of Figure |3| Thee effective population size for these trees 
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was 1500, which corresponds to a situation where there is a moderate amount of variabihty 
in the coalescent trees. Scores for the trees in this data set were computed, and an estimate 
of the distribution of these non-outher scores is shown at the bottom of Figure |2j 

To estimate the distribution of non-outher scores, a single random coalescent tree 
was appended to the set of trees previously generated, and the score for this new outher 
tree was computed. This process was repeated many times (always using the same set of 
non-outlier trees) to generate a sample of outlier tree scores. An estimate of the 
distribution of these outlier tree scores is shown at the top of Figure [2j 

Code and documentation for the simulations is included in the kdetrees software 
package download. The Python scripts used to generate the simulated data sets are also 
included. 



Apicomplexa Data 



The data set from Kuo et al. (2008) consists of gene trees reconstructed from the following 



sequences: Babesia bovis (Bb) (Brayton et al. 2007) from GenBank (GenBank accession 



numbers AAXT01000001-AAXT01000013), Cryptosporidium parvum (Cp) (Abrahamsen 



et al. 2004) from CryptoDB.org (Heiges et al. 2006), Eimeria tenella (Et) from 



GeneDB.org (Hertz- Fowler et al. 2004), Plasmodium falciparum (Pf) (Gardner et al. 2002) 



and Plasmodium vivax (Pv) from PlasmoDB.org (Bahl et al. 2003), Theileria annulata (Ta 



(Pain et al. 2005) from GeneDB.org ( Hertz- Fowler et al. 2004), and Toxoplasma gondii 



(Tg) from Toxo-DB.org (Gajria et al. 2008). A free-living ciliate, Tetrahymena thermophila 



(Tt) (Eisen et al. 2006), was used as the outgroup. 
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Table 1. Apicomplexa gene sets identified as outliers by kdetrees. 



Nn 1 




rTi 1 n n"!"! r~»Ti a 1 A n n i~\ on 

X Llill^ LiUlifXl iT^lillU tCXLlUil 


488 


PF08_0086 


RNA-binding protein, putative 


497 


PF13_0228 


40S ribosomal subunit protein S6, putative 


515 


PFA0390W 


DNA repair exonuclease, putative 


546 


PFF0285C 


DNA repair protein RAD50, putative 


547 


PFL1345C 


Radical SAM protein, putative 


641 


PFE0750C 


hypothetical protein, conserved 


660 


PF10_0043 


ribosomal protein L13, putative 


662 


PF11_0463 


coat protein, gamma subunit, putative 


728 


MAL13P1.22 


DNA hgase 1 


747 


PFB0550W 


Peptide chain release factor subunit 1, putative 


773 


PFF0120W 


putative geranylgeranyltransferase 


780 


PFD0420C 


flap exonuclease, putative 



Based on geneset designations in Kuo et al. (2008). 
^Geneset represented by GenelD for Plasmodium falciparum. 
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Figure 1. Summary of simulation results comparing performance of kdetrees and 
Phylo-MCOA for various values of the effective population size. Shown is the proportion of 
simulated data sets in which the methods identified the outlier tree. At left are the "single" 
contained coalescent simulations, with the non-outlier trees all contained within a single 
species tree. At right are results from a "mixed" simulation, with the non-outlier trees 
generated from a mixture of 5 species trees. 

Figure 2. Kernel density estimates of the observed distribution of tree scores. The 
"coalescent" scores are for contained coalescent trees generated within a fixed species tree 
(bottom). A single random outlier tree is added to this data set and its score computed. 
This process is replicated to generate the sample of "outlier" tree scores (top). Lines and 
dots represent the 5%-95% quantiles and the median, respectively. 

Figure 3. The species trees used to generate gene trees under the coalescent model for the 
simulation experiments. At top-left is the tree used for the "single" coalescent distribution 
simulations, while the other trees are used in the "mixed" simulations. 

Figure 4. Summary of the design of the simulation comparing kdetrees and Phylo-MCOA. 
(See Figure [3] for a plot the species tree used.) The non-outlier trees were generated under 
two models, the "single" coalescent model, and the "mixed" model. This process was 
replicated many times, and the count of successful identifications for each method recorded. 

Figure 5. Summary of the simulation design for the simulation comparing the tree score 
distributions for outlier trees and non-outlier trees. 
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Table S.l. Analysis of Apicomplexa gene-sets identified as outliers 



Gene ID Functional Annota- Analysis 

tion 



PF08_0086 



PF13_0228 



PFA0390W 



PFF0285C 



PFL1345C 



PFE0750C 



PF11_0463 



PFB0550W 



PFF0120W 



PFD0420C 



RNA-binding pro- 
tein, putative 

40S ribosomal sub- 
unit protein S6, pu- 
tative 

DNA repair exonu- 
clease, putative 



DNA repair protein 
RAD50, putative 



Radical SAM pro- 
tein, putative 
hypothetical pro- 
tein, conserved 

ribosomal protein 
L13, putative 

coat protein, 
gamma subunit, 
putative 
DNA ligase 1 



Peptide chain re- 
lease factor subunit 
1, putative 

putative geranyl- 
geranyltra nsfgrase 
flap cxonuclease, 
putative 



Significant sequence lengtlT^^^^^ [Lb'l a.a. for Ta vs 1075a. a. 
for Pf). Generally good sequence alignment in one region of 100 
residues; otherwise, alignment is poor. 

Tt sequence much longer than all others; long N-terminal and 
C-terminal extensions. Very good alignment in blocks, but with 
lengthy insertions for outgroup Tt. Possible incorrect annotation 
of Tg sequence. 

Short sequences for Et and Cp. Several homopolymer stretches 

in Et. Modest to good alignment in nniltiple blocks, Et being an 
exception in several regions. Possible incorrect annotation of Et 
sequence. 

Poor alignment in general. Three modest blocks (50-100 aa) of 
reasonable sequence alignment. Et sequence contains long ho- 
mopolymeric stretches. Pf and Pv have long insertions that might 

be translated introns. 

Relatively short sequence for Et. Homopolymeric stretch at N- 
terminus of Tg. Modest to good alignment in blocks. 
Large difference in sequence lengths; 269 residues for Et vs. 848 for 
Pf. Central region with modest to good alignment; Et exhibited 
poor sequence identity suggestion it might not be a homologue. 
80 residue N-terminal extension in Tg. Good sequence alignment, 
with Tt (outgroup) being an exception. Tt sequence might not 
be a homologue. 

Multiple homopolymer stretches in Et sequence. Generally good 
alignment for all but Et; sequence might not be homologous. 

Homopolymer stretches in Et sequence with poor alignment to 
other sequences. Et sequence might be incorrectly annotated 
and/or might not be homologous. 

Short sequence for Et (132 residues), with long homopolymer 
stretch. Other sequences are approximately 425 a.a. in length. 
Generally good alignment, even for Et over a short region ( 50 
residues). Possible incorrect annotation of Et sequence. 
Two homopolymer stretches (serine) in Et sequence. Moderately 
good alignment. Possible incorrect annotation of Et sequence. 
Very discrepant sequence lengths; 179 a.a. for Et vs. 2213 a.a. 
for Tt. All other sequences 500 — 600 residues in length. Good 
alignment over several regions, although sequence for Et is absent 
in portions of these regions. Very long N-terminal extensions and 
insertions in Tt sequence. Possible incorrect annotations for Et 
and Tt. 



Pf = Plasmodium falciparum, Pv = Plasmodium vivax, Bb = Babesia bovis, Ta = Theileria annulata, Et = 
Eimeria tenella, Tg = Toxoplasma gondii, Gp = Cryptosporidium parvum, and Tt = Tetrahymena 
thermophila (outgroup). 
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Figure S.l. Summary of tree scores for the Apicomplexa data set. In the top row the 

scores of individual trees are shown. "Tree index" refers to the ordering of the trees in the 
input files. In the bottom row, the scores are summarized as a histogram. In the left 
column are the results computed with branch-length information, while the topology-only 
results are shown at right. 
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Figure S.2. Schematic of how trees are converted to vectors. Numbers on branches in the 
unrooted tree are branch lengths. In this example, the tree is first converted to either a 
branch length-based dissimilarity map (matrix of distances between tips) or topological 
dissimilarity maps (matrix of number of edges between tips). Moving from left to right 
across rows in one half of a matrix, values are placed into a single column to yield a vector 
of distances between tips in the tree. 
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Score: 0.358226541797286 



Score: 0.357573359515257 



Figure S.3. Plots of Apicomplexa gene trees identified as outliers. (Continued on next 
figure.) 
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Figure S.4. Plots of Apicomplexa gene trees identified as outliers. (Continued on next 
figure.) 
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Figure S.5. Plots of Apicomplexa gene trees identified as outliers. (Continued from 
previous figures.) 



